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Background: The differential blood cell count provides valuable information about a person's 
state of health. Together with a variety of biochemical variables, these analyses describe impor- 
tant physiological and pathophysiological relations. There is a need for research databases to 
explore associations between these parameters, concurrent comorbidities, and future disease 
outcomes. 

Methods and results: The Copenhagen General Practitioners' Laboratory is the only labora- 
tory serving general practitioners in the Copenhagen area, covering approximately 1.2 million 
inhabitants. The Copenhagen General Practitioners' Laboratory has registered all analytical 
results since July 1, 2000. The Copenhagen Primary Care Differential Count database contains 
all differential blood cell count results (n=l,308,022) from July 1, 2000 to January 25, 2010 
requested by general practitioners, along with results from analysis of various other blood 
components. This data set is merged with detailed data at a person level from The Danish Cancer 
Registry, The Danish National Patient Register, The Danish Civil Registration System, and The 
Danish Register of Causes of Death. 

Conclusion: This paper reviews methodological issues behind the construction of the 
Copenhagen Primary Care Differential Count database as well as the distribution of character- 
istics of the population it covers and the variables that are recorded. Finally, it gives examples 
of its use as an inspiration to peers for collaboration. 

Keywords: differential leukocyte count, research, nationwide health registers 

Introduction 

One of the most common blood tests in the world, the differential blood cell count (DIFF), 
provides valuable information on the relative percentage of each type of white blood 
cell in the peripheral blood. It also provides data on the occurrence of abnormal white 
blood cell populations like leukemic blast cells, immature myeloid cells, and circulating 
lymphoma cells. Together with the hemoglobin and platelet count, the DIFF constitutes 
the complete blood count (CBC), which supplies important information about a person's 
state of health. Blood sampling may be done in all corners of the health care sector and 
takes only minutes to perform. Anticoagulants in sample media allow storage of samples 
for hours, even days if properly cooled. 1 Accordingly, the DIFF and CBC are used for 
a broad range of medical indications in diagnosis and monitoring of disease activity. 
Together with a variety of other biochemical parameters, it is also possible to monitor 
medical therapy as well as to establish prognostic indexes for a plethora of diseases. 

However, there is a need for research databases to explore associations between 
these cellular and biochemical variables, prior and/or concurrent comorbidities, and 
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disease outcomes. Such databases offer unique opportuni- 
ties to follow long-term outcomes of well-characterized 
individuals. In 2008, researchers from the Research Unit 
for General Practice in Copenhagen began constructing the 
Copenhagen Primary Care Differential Count (CopDiff) 
database in order to meet this demand. 

The CopDiff database extends the purpose of other impor- 
tant research databases such as 1) the Copenhagen General 
Population Study, 2 since the CopDiff database encompasses 
both the young (<18 years) and the old (>80 years), and 
2) the Clinical Laboratory Information System research 
database, 3 since the CopDiff supplies data from primary care. 
By linkage to nationwide health registers, the CopDiff will 
have the capability to assess the prognostic value of the DIFFs 
for several clinical outcomes, while adjusting for certain 
potential confounders. Access to data on some 550,000 indi- 
viduals (constituting some 10% of the Danish population) 
over a 1 0-year period enables the CopDiff database to assess 
both common and rare disease outcomes. A regular update 
will allow for the extension of the database into the future 
with still more outcomes/events. 

The purpose of this paper is to review the content of 
the CopDiff database, describe basic analytical approaches 
(biochemical and statistical), and lastly, to encourage col- 
laboration with fellow scientists. 

Materials and methods 

Construction and content 

The cellular and biochemical variables 
of the CopDiff database 

In the Copenhagen area, with its approximately 1 .2 million 
inhabitants, there is only one laboratory serving general 
practitioners (GPs), the Copenhagen General Practitioners' 
Laboratory (CGPL), known as the Elective Laboratory 
of the Capital Region since January 1, 2013. The CGPL 
was founded in 1922 and serves a total of 739 GPs in 
567 practices (2010) with a broad range of blood tests, 
clinical physiological tests, and various cardiac tests. The 
CGPL has International Organization for Standardization 
15 189 accreditation and has saved all values on the analyses 
it performs since July 2000. The CGPL offers two routine 
groups of hematology analyses for the GPs: 

1 . "HEM": hemoglobin, mean red cell volume, red cell dis- 
tribution width, total white blood cell count, and platelet 
count. 

2. "CBC": the HEM group plus differential counts of 
white blood cells (neutrophils, lymphocytes, monocytes, 
eosinophils, and basophils). 



The individual components of the groups cannot be 
requested alone. To obtain, for example, hemoglobin, the 
GP has to request either "HEM" or "CBC". 

The CBC requests from the period July 1, 2000 until 
January 25 , 20 1 0 were included in the first step of building the 
CopDiff database (Table 1). The stand-alone "HEM" requests 
were excluded. All other analyses requested by the GP in 
addition to the CBC, if on the same requisition, were also 
included in the database (Table 2). Hence, common for all 
individuals was the existence of a CBC estimation while the 
remaining analyses were only included for a particular patient 
if the GP had ordered these analyses on the same requisition 
on which the CBC was ordered. Requests for CBCs from 
non-GPs (ie, specialized consultants with their own practices) 
were excluded in order to obtain a pure primary care resource 
(Figure 1). Of note, these requisitions have not been deleted 
from the CopDiff servers and may be analyzed if it becomes 
relevant to include them in an analysis. 

The CopDiff database eventually included 1,308,022 
requisitions on 555,039 unique individuals to be further 
merged with data from nationwide registers described below. 
All requisitions with numeric and alphanumeric (but valid) 
results were also categorized according to reference limits 
at the time of the analysis as either normal, below, or above 
reference range in separate variables (Table 2). 

Analytical methods of the CopDiff database 

All CBC samples were analyzed on Siemens (Bayer/ 
Technicon, Munich, Germany) hematology systems. CGPL 
used three similar types of these instruments in the period 
2000-2010, which in chronological order were Technicon® 
H3 RTX (used between 2000 and 2002), ADVIA® 120 (used 
between 2002 and 2010), and ADVIA® 2120i (used together 
with ADVIA® 120 from 2009 to 2010). The basic chemical 
and physical methods are identical among these systems. 
In general, samples were treated with certain chemicals 

Table I Characteristics of the CopDiff database population 

Sex, n (%) 

Male 232,251 (41.8) 

Female 322,788 (58.2) 

Age at first requisition, years 46.9±2 1 .5 

Requisitions, total 1,308,022 

Requisitions per patient 2.36±2.8I 

Deaths before January 25, 20 1 0 6 1 ,4 1 6 ( I I . I ) 

Emigrated/disappeared/inactive before January 25, 20 1 0 1 0,669 ( 1 .9) 

Years from first requisitions until January 25, 2010 4.98±2.87 
or death/emigration/inactivation 
Note: Values are numbers (%) or means (SD). 

Abbreviations: CopDiff, Copenhagen Primary Care Differential Count; SD, stan- 
dard deviation. 
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All requisitions of a DIFF 
between 2000 and 2010: 
1,579,524 requisitions from 
621,547 individuals 



Excluded (6,512 requisitions) 

♦ Nonnumeric social security numbers 

♦ Non-CRS-validated social security 
numbers 



1,573,012 requisitions from 
616,540 individuals 



Excluded (264,990 requisitions) 
> ♦ Requisitions from non-GPs 



£L 

(3 
O 



1,308,022 requisitions from 
555,039 individuals 



Analysis 




Nationwide registers 


< 



Figure I Flowchart. 

Abbreviations: CGPL, Copenhagen General Practitioners' Laboratory; CopDiff, Copenhagen Primary Care Differential Count; CRS, Danish Civil Registration System; DIFF, 
differential cell count; GP, general practitioner. 



inside the instruments and the absorbance of hemolysate and 
light scatter of individual cells were measured. Samples were 
subjected to microscopic (manual) differential cell counting 
of leukocyte types if nagged for this during the initial auto- 
mated differential counting. For this, we used polychrome 
methylene blue and eosin stains. Method principles valid 
for all three systems, based on the ADVIA* 120 system, are 
accessible in the Supplementary materials (Appendix 1 and 
Table SI). When switching from H3 RTX to ADVIA® 120, 
there was a relative drop of 5% in "red cell distribution width" 
analyses. Reference intervals were updated accordingly. No 
other changes in hematological analyses in the CopDiff period 



(2000-2010) were performed. The properties defined by the 
Nomenclature, Properties, and Units codes for alkaline phos- 
phatase and lactate dehydrogenase, as noted in Table 2, were 
introduced March 1 1 , 2004. Before that, alkaline phosphatase 
and lactate dehydrogenase were measured with methods that 
gave higher results. These higher results are also included in 
the CopDiff database. 

Danish nationwide registers on health 
and social status 

Denmark has a long tradition of collecting miscellaneous 
information on disease incidence, social relations, and 
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other data describing its population. Permanent residents in 
Denmark are provided with a personal identification num- 
ber, which functions as a cornerstone in efficient linkage 
between all registers containing information at an individual 
level. Existing registers on health issues encompass, among 
other information, data on the use of primary and secondary 
health care as well as diagnoses from contacts with hospi- 
tals, including psychiatric hospitals, benign and malignant 
conditions, and the prescription of drugs in primary health 
care. Furthermore, available registers on social issues con- 
tain data on education, living conditions, labor, earnings, 
income, etc. 4 Researchers employed at authorized research 
institutions in Denmark can obtain access to individual level 
data which enables the Research Unit for General Practice 
in Copenhagen to link paraclinical data from the CGPL to 
nationwide registers. A comprehensive list of existing reg- 
isters, including detailed information on structure, access, 
legislation, and archiving of Danish registers on health and 
social issues, has been reviewed recently. 4 Also, a thorough 
description of the most important Danish population-based 
registers for public health and health-related welfare research 
has been published. 5 

In April 20 1 1 (and again in November 20 1 3), the CopDiff 
database linked all 555,039 individuals to 1) the Danish Civil 
Registration System (CRS); 2) the Danish Cancer Registry, 
containing data on all malignancies in Denmark since 1 942 
and to which reporting is mandatory; 6 3 ) the Danish National 
Patient Register including information on all contacts with 
hospitals in Denmark, inclusive of discharge diagnoses, out- 
patient clinic contacts, and surgical procedures performed; 7 
and 4) the Register of Causes of Death, containing informa- 
tion on causes of death based upon death certificates. 8 

Results 

Utility 

The general type of research question that can be answered 
by the CopDiff database is whether certain levels of selected 
blood components are associated with an increased risk of 
certain future disease outcomes. Given that the CopDiff 
database was constructed on the basis of existing DIFFs, 
any researcher with a hypothesis not directly involving 
leukocytes as main variables of interest should bear in mind 
the risk of inappropriately excluding potential relevant 
individuals if such cases have not been referred to DIFF 
sampling by their GPs. Notably, DIFF sampling performed 
or requested in secondary care is not included, and the 
CopDiff database therefore does not contain hospitalized 



individuals. Furthermore, due to inclusion/selection of 
individuals who have been referred to DIFF sampling, it 
may be assumed that the CopDiff cohort has more morbidity 
than the (nonhospitalized) background population. Statistical 
methods implemented to analyze these data have to take 
this selection bias into account. Another challenge is the 
assessment of outcomes in the presence of competing risks. 
Particular leukocyte configurations may increase mortality, 
reducing the time for certain diseases of interest to develop, 
and thereby artificially reduce the risk for such diseases. 
In the Supplementary materials (Appendix 2), we give a 
portfolio of statistical analysis designs and their advantages 
and disadvantages and illustrate their use with data from an 
already published study. 

Discussion 

By constructing the CopDiff database, we believe a novel 
opportunity has been created to explore associations between 
DIFFs from the peripheral blood and biochemical parameters, 
concurrent comorbidities, and disease outcomes. An impor- 
tant limitation in the construction of the database is the way 
the individuals were selected in general practice, with a wide 
variety of unknown clinical problems, and individuals without 
existing DIFFs were not included. Nevertheless, by encom- 
passing the young (<18 years) and the old (>80 years), the 
CopDiff database allows for the assessment of associations 
through a lifetime for a large primary care population. The 
access to all DIFFs from all GPs in the Copenhagen area 
over a 10-year period offers unique insight into the entire 
Copenhagen area, covering approximately 1 .2 million inhabit- 
ants. Since the CopDiff population was sampled continuously 
without any restrictions as to why the DIFF was requested by 
the GP, the risk of selection bias is diminished among these 
individuals. In time, the merging of the CopDiff database 
with other population-based registers such as the Danish 
Drug Prescription Register, the Danish Heart Register, the 
National Diabetes Register, and the Danish Psychiatric Central 
Research Register will allow for exploration of new areas of 
research. The possible combined assessment of individuals 
from the general population (Copenhagen General Population 
Study), primary care (CopDiff), and secondary care (Clinical 
Laboratory Information System) will provide the basis for 
unique insight into patient journeys. 

Conclusion 

This paper has given insight into the fundamentals of the 
CopDiff database, described its content, and by giving 
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examples of statistical analytical approaches, hopefully 
inspired researchers to develop possible future uses. We 
hereby encourage our peers to contact us in order to col- 
laborate on new projects or to test hypotheses in the CopDiff 
cohort. 

Further information on the CopDiff database, steering 
committee, bylaws, and ways to collaborate can be found by 
visiting the CopDiff homepage ( http://almenpraksis.ku.dk/ 
english/research/copdifF ). 
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Supplementary materials 

Appendix I 

CBC, reticulocyte, and white blood cell differential 
analyses using the ADVIA® 120 Hematology System 

The ADVIA* 120 was used from 2002 to 2010, and the 
method principles for this instrument are valid for all three 
instrument types (Technicon® H3 RTX, ADVIA® 120 and 
ADVIA® 2120i). We used the cyanide hemoglobin method 
throughout the period (also on ADVIA® 21 20i). Basically, the 
samples were treated with chemicals in the instruments, and 
the absorbance of hemolysate and light scatter of individual 
cells were measured. 1 

Principle of the test 

The ADVIA® 120 Hematology System is a fully automated 
diagnostic instrument that uses cytochemical reactions to 
differentiate and count white blood cells, red blood cells, 
platelets, and reticulocytes. There are two main components 
of the system: the analyzer and the personal computer. 
In the analyzer, blood samples are aspirated and divided 
into aliquots for the different types of tests. Reagents and 
segmented samples are delivered to reaction chambers where 
they are mixed, and a cytochemical reaction takes place. 
Once the reactions are complete, the sample and reagent 
mixtures from the so-called "peroxidase", "red blood cell", 
"basophil", and "reticulocyte" reaction chambers are sent to 
the flowcells for analysis. The hemoglobin measurement is 
read in the hemoglobin reaction chamber that serves as an 
optical cuvette. After analysis, the sample and reagent mix- 
ture are evacuated into the waste container and the appropriate 
pathways and reaction chambers are rinsed. Test results are 
sent to the computer to be reviewed and edited. 

The ADVIA® 1 20 Hematology System can run five selec- 
tivities: CBC, CBC/DIFF, reticulocytes, CBC/reticulocytes, 
and CBC/DIFF/reticulocytes. The system has a throughput 
of 120 samples per hour when running CBC or CBC/DIFF 
and a throughput of 74 samples per hour when running the 
other selectivities. Up to 150 sample tubes can be loaded 
onto the barcoded racks of the autosampler. Single or STAT 
(short turn around time) samples can be tested on the manual 
samplers (Table SI). 

Appendix 2 

Design considerations and case study: eosinophilia 
in routine blood samples and the subsequent risk 
of hematological malignancies 2 

The nature of the data in the CopDiff database - the way the 
data are obtained, the dynamics of the capture population, 



the sheer amount of data - requires careful considerations 
regarding the methods used to analyze relevant hypotheses. 
Two basic analytical approaches that can be considered are 
the case-control design and the cohort design. 

Case-control designs 

In a classic case-control design, we take the outcome of inter- 
est as our point of departure, and cases are the individuals 
who experience this outcome. Exposure for the cases is 
then determined by the latest measurement in the CopDiff 
database within a fixed period before the (first) occurrence 
of the outcome. For each case, controls have to be chosen 
from all individuals who do not have the outcome at the time 
the case's outcome occurs. Choosing all controls for each 
case will cause controls to feature in the data multiple times. 
We need to control for this feature in the analysis or in some 
clever linkage of controls to cases. Choosing a limited num- 
ber of controls, possibly matched on some characteristics of 
the case, will reduce the data and reduce, but not solve, the 
multiplicity problem. Moreover, controls will feature in the 
data only when a measurement in the CopDiff database is 
within the fixed period before the corresponding case's occur- 
rence of the outcome. In conclusion, we find this approach 
too cumbersome and not suited to answer apparent research 
questions in the CopDiff database. 

Cohort designs 

Two other approaches start from the exposure and construct 
cohort data. We opt for choosing randomly one single 
measurement in the CopDiff database for each individual in 
order not to have to control for people that enter the cohort 
multiple times at different points in time. From the time of 
the exposure, we then look forward in the Danish national 
registers for the first occurrence of the outcome. Individuals 
for whom the outcome has already occurred at the time of 
the exposure measurement are excluded from the analysis. 
This information can be used in two ways: 
• The first approach uses logistic regression to model the 
probability of experiencing the outcome in a specified 
time period after the exposure. The main advantages of 
this methodology are 1) the outcome is well-understood 
and answers a clinically relevant question: "Will the risk 
of experiencing the outcome in the coming x-year period 
be higher if measurement y, taken now at the laboratory, 
is abnormal?" and 2) the effect estimate of the exposure 
is an odds ratio (OR), which is approximately the same as 
a relative risk and also invariant to the prevalence of the 
outcome. As mentioned previously, the CopDiff sample is 
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Table SI Chemical principles 

Test name Chemical principle 

White blood cell count The whole blood sample is mixed with ADVIA® 1 20 BASO reagent that contains acid and surfactant. 

The red cells are hemolyzed, and the white blood cells are then analyzed using two angle laser light 
scatter signals. 

Red blood cell/platelet count Both red blood cells and platelets are analyzed by a single optical cytometer after appropriate dilution 

of the blood sample with ADVIA® 1 20 RBC/PLT reagent. The red blood cells are isovolumetrically 
sphered and lightly fixed with glutaraldehyde to preserve the spherical shape. Red cells and platelets 
are counted from the signals from a common detector with two different gain settings. 
On the ADVIA® 1 20 Hematology System, the platelet signals are amplified considerably more than the 
red blood cell signals. Coincidence correction is made to each of the counts so that accurate counts 
are made over a wide range of each cell type. 

Red blood cell/platelet size The method of sizing red cells and platelets uses the simultaneous measurement of laser light scattered 

at two different angular intervals, which eliminates the adverse effect of variation in cellular hemoglobin 
concentration on the determination of cell volume. 

Hemoglobin concentration The hemoglobin method is a modification of the manual cyanmethemoglobin method developed by the 

International Committee for Standardization in Hematology. 

The sample and ADVIA® 1 20 HGB reagent are mixed in the hemoglobin reaction chamber 
(colorimeter). The hemoglobin chemical reactions consist of two steps: the red blood cells are lysed 
to release hemoglobin and the heme iron in the hemoglobin is oxidized from the ferrous to the ferric 
state. It is then combined with cyanide in the ADVIA® 120 HGB reagent to form the reaction product. 

Reticulocyte cell count This method uses a nucleic acid dye (oxazine 750) to stain cellular RNA. 

Two microliters of an EDTA anticoagulated whole-blood sample are mixed online with the ADVIA® 
120 autoRETIC reagent. The ADVIA® 120 autoRETIC reagent isovolumetrically spheres the erythroid 
cells and stains cellular RNA. Low-angle laser light scatter, high-angle laser light scatter, and absorption 
characteristics of all cells are counted and measured. The absorption data are used to classify each cell 
as a reticulocyte or mature red blood cell based on its RNA content. 

Reticulocyte size The method of sizing reticulocytes uses the simultaneous measurement of laser light scattered at 

two different angular intervals, which eliminates the adverse effect of variation in cellular hemoglobin 
concentration on the determination of the mean reticulocyte volume parameter. 

CHr The CHr is the mean of cellular hemoglobin content (CH) histogram for the reticulocyte population. 

Peroxidase method The peroxidase cytochemical reaction consists of two steps. In the first step, EDTA anticoagulated 

whole-blood sample is diluted with ADVIA® 1 20 PEROX I reagent. Surfactants and thermal stress cause 
lysis of the red blood cells. Formaldehyde in ADVIA® 1 20 PEROX I reagent fixes the white blood cells. 
During the second step, ADVIA® 1 20 PEROX 2 reagent and ADVIA® 1 20 PEROX 3 reagent are added 
to the peroxidase reaction chamber. The 4-chloro- 1 -naphthol in ADVIA® 120 PEROX 2 reagent and 
the hydrogen peroxide in ADVIA® 1 20 PEROX 3 reagent stain the sites of peroxidase activity in the 
granules of neutrophils, eosinophils, and monocytes. Lymphocytes, basophils, and large unstained cells 
contain no granules with peroxidase enzyme activity. 

A constant volume of the cell suspension from the peroxidase reaction chamber passes through the 
flowcell.The two fluids flow as independent, concentric streams (no mixing), with the ADVIA® 
120 PEROX SHEATH stream encasing the sample stream. The absorbance and the forward light- 
scattering signatures of each blood cell are measured. The optical signals are converted to electrical 
pulses by photodiodes. After processing, the information is displayed in two histograms. The Perox 
Y histogram contains the forward-scattering data (cell size). The Perox X histogram contains the 
absorption data (peroxidase staining). The two histograms are combined to form the Perox cytogram 
from which cells are identified and counted. 
Basophil/lobularity method When the EDTA anticoagulated whole blood sample is mixed with ADVIA® 1 20 BASO reagent, the 

red blood cells are hemolyzed and the cytoplasm is stripped from all white cells except basophils. The 
sample is then analyzed by two-angle laser light scattering detection using a laser diode. The white cells 
are classified into three categories: basophils, mononuclear cells, and polymorphonuclear cells. 

Note: ADVIA® 120 BASO from Siemens (Bayer/Technicon, Munich, Germany). 
Abbreviation: EDTA, ethylenediaminetetraacetic acid. 



expected to have a higher morbidity than the background 
population because these people were referred for blood 
testing. However, the OR calculated from this sample can 
be transferred to the background population of all Danes 
and interpreted as a relative risk if, as is often the case, the 



outcome is rare. 3 The disadvantages of this approach are 
1 ) much of the information in the timing of the occurrence 
after the exposure is lost and 2) individuals who die or 
emigrate in the fixed time period after the exposure have an 
artificially low probability of experiencing the outcome. 
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• The second approach uses survival analysis, eg, Cox 
proportional hazard regression, to model the time until 
first occurrence after exposure. The advantages of this 
approach are 1) various follow-up periods are allowed for 
instead of a single one and 2) death, emigration, and other 
reasons for differing follow-up periods are accounted for 
by censoring. The disadvantages are 1) the incidence rate 
ratio (or hazard ratio [HR]) that is the effect measure in 
a Cox regression is not invariant to the prevalence of the 
outcome. For this reason, the result, in principle, cannot 
be transferred to the Danish population in general, and 
2) the large amount of data in the CopDiff database will 
cause any test of the proportional hazard assumption to 
reject this with high probability. This will change the 
focus of the analysis toward investigating the develop- 
ment of the exposure effect over time. Since the timing 
of the exposure (blood sampling) is not a well-defined 
time point in the development of the disease, this time 
stratification seems inappropriate. 

We have a slight preference for the first cohort method- 
ology because of its epidemiological simplicity and straightfor- 
ward interpretation. Only if we can attach clinical significance 
to the timing of the exposure, eg, if it is a diagnostic measurement 
of some sort, a survival analysis may be more relevant. 

The two cohort design approaches are illustrated in the 
following data example. 

Illustrative example 

For this analysis we included all adults aged 18 to 
80 years from the CopDiff database. From each of these 
359,950 unique individuals, with at least one DIFF in the 
period January 1 , 200 1 to December 3 1 , 2007, we randomly 
selected a single DIFF that contained an eosinophil count. 
These individuals were then categorized according to the 
degree of eosinophilia. Individuals with missing values 
for the eosinophil count (n=3,754) were excluded from the 
cohort. As a potential confounder, the level of C-reactive- 
protein (CRP), categorized as "increased" (^10 mg/L) 
versus "normal" (<10 mg/L) was also obtained from the 
CopDiff database. A third category was defined for those 
individuals for whom CRP was not measured. We com- 
puted Charlson's comorbidity index 4 from the hospital 
contacts recorded in the Danish National Patient Register 
for the 3 years before the index DIFF. Furthermore, we 
recorded whether another DIFF was made during the 
6 months before the request and whether eosinophilia 
was present in this DIFF. The objective of the analysis 



was to investigate whether eosinophilia was associated 
with increased incidence of hematological malignancies 
(as recorded in the Danish Cancer Registry) in the period 
following the selected DIFF; in the following we illustrate 
the two approaches. Both analyses estimate the effects of 
eosinophilia adjusted for sex, age (quadratic), year, month, 
previous cancer, Charlson's comorbidity index, CRP, and 
previous eosinophilia. 

Analysis approach A: logistic regression 

The first approach analyzes the 3 -year incidences of 
hematological malignancies in a multivariate logistic regres- 
sion model. The effects of eosinophilia were estimated with 
OR (95% confidence interval): 

• mild versus no eosinophilia: 1.36 (1.02-1.80) 

• moderate versus no eosinophilia: 3.41 (1.75-6.65) 

• severe versus no eosinophilia: 5.98 (3.03-1 1.78) 
These results clearly show a trend toward higher 

hematological malignancy incidence with higher degree 
of eosinophilia. However, in a parallel analysis a similar 
trend could be seen for mortality. Hence, the incidence of 
hematological malignancy may be artificially low for the 
more severe eosinophilia cases, which causes the effects to 
be less pronounced than they should have been. 

Analysis approach B: Cox proportional 
hazard regression 

The second approach analyzes the time to the first occurrence 
of hematological malignancy in a multivariate Cox 
proportional hazard regression model, or to death or end of 
follow-up. The effects of eosinophilia were estimated with 
HR (95% confidence interval): 

• mild versus no eosinophilia: 1.38 (1.09-1.75) 

• moderate versus no eosinophilia: 3.11 (1.71-5.65) 

• severe versus no eosinophilia: 4.88 (2.61-9.14) 

This analysis also shows a clear trend: more severe 
eosinophilia is associated with higher incidence of 
hematological malignancies. Although the results from 
the two approaches are numerically quite similar, the two 
different effect measures are not comparable. However, if the 
event is rare and the proportional hazard assumption is true, 
the 3-year incidence will be similar to the hazard at 3 years, 
and the OR and HR will be numerically similar. 

A problem with the second approach is the proportional 
hazard assumption. A statistical test for this assumption, 
eg, a likelihood ratio test for the addition of interactions of 
all covariates with log (time), will be overpowered. More- 
over, such a test was not possible in SAS PROC PHREG 
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(SAS Institute Inc., Cary, NC, USA), as the estimation of 
a model including interactions with log (time) took too 
long to compute. A graphical test (as implemented in SAS 
PROC PHREG) produces for each covariate in the model 
a plot of the observed score process against several score 
processes simulated assuming proportional hazards. If the 
observed process is different from the simulated processes, 
the proportional hazard assumption is considered to be vio- 
lated. 5 For large databases such as CopDiff, this may take a 
long computing time if the event of interest is not rare. For 
the above analysis, two such plots are shown in Figure S 1 . 
Figure SI A indicates that in relation to the mild versus no 



eosinophilia effect, the proportional hazard assumption 
holds. However, in relation to the previous DIFF effect, 
incidence is higher than expected in the first years after the 
selected DIFF (Figure SIB). Similar patterns are seen for 
some other covariates. The proportional hazard assumption 
may be handled by either estimating the baseline hazard 
separately, in strata spanned by categories of the violating 
variables, or by splitting time up into separate periods for 
which separate effects are calculated. To the extent this is 
possible given computation times, this will blur the inter- 
pretation of the effects of eosinophilia on hematological 
malignancies. 
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